-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Usernetes Gen2: depends on Rootless Docker on hosts #287
Conversation
f41e39a
to
d1f7359
Compare
91f8940
to
64c0087
Compare
8d429d8
to
f0903ac
Compare
Okay so sounds like I should try to remove the shared filesystem and get rootless working? Is there a way to change that path and I can throw it somewhere else? |
I guess you can just make a symlink, or |
okay making progress! I make the nodes isolated for now - we can try the above later. I was able to get rootless docker installed and the control plane and nodes up - I'm trying to run the hack test now, and there is an error with the shell. When I control C:
It looks like the entrypoint is doing wget to the others, so I can try that manually. Ah, there is a timeout:
Update: the same timeout happens with
Maybe related? https://gitlab.freedesktop.org/dbus/dbus/-/issues/374 |
Unlikely
Unlikely |
Also, please make sure "10.10.0.5" is the IP of the host (not the node container) that is reachable from other hosts. |
This has happened twice now - it freezes on the worker node connecting:
I'm just going to control c and continue with one worker node for now. |
actually I take it back - it's not working on either worker node now. This step hangs:
Even when I run the linger command and daemon-reload that message comes up. |
The token value shouldn't be pasted publicly.
Seems a networking issue. Is |
Weird, I'm getting the error earlier (I haven't joined the worker nodes yet, this is from the control plane)
|
I don't think so - this is from the same host:
And here is from 002:
|
|
No route to host. That's so weird, this just worked on the previous cluster I brought up (and no differences)
|
I'm going to tear it down and bring up again from scratch. |
okay this time I am trying a larger node (just to sanity check) and the pre-flight check failed for the first node:
And for the second node it's still hanging. @AkihiroSuda I think you are 15 hours ahead of me, so 4pm my time is 7am your time, 5pm my time is 8am (start of the work day?) We were planning on doing a small hackathon this Friday to work on this - and I wanted to invite you / see if you are available? 7am is quite early, but if you are up around 8am I think I could work a bit later on Friday. I could potentially do an hour later, just need some notice for that! High level - I'd like to get this terraform setup working, and consistently, so I can contribute it here. I am going to try one more thing tonight - bringing it totally down and up, and running the scripts interactively. If there is some subtle difference with a service not persisting in this automated mode, that might do it. I will update the thread here, and let me know if you might have some time on Friday so we can bring up this setup and get your eyes on it (I am likely missing something obvious and this is very likely the best means to finishing it up!) |
Reproduced in manual running mode, so unlikely to be the automation bit. ✔ Network usernetes_default Created 0.1s
✔ Volume "usernetes_node-opt" Created 0.0s
✔ Volume "usernetes_node-etc" Created 0.0s
✔ Volume "usernetes_node-var" Created 0.0s
✔ Container usernetes-node-1 Started 4.9s
docker compose exec -e U7S_HOST_IP=10.10.0.4 -e U7S_NODE_NAME=u7s-usernetes-compute-002 -e U7S_NODE_SUBNET=10.100.153.0/24 node kubeadm join 10.10.0.3:6443 --token boydm6.lgdgji6o10zhcrww --discovery-token-ca-cert-hash sha256:60006cde0edda31f26cae0f2a80ef7fac7803d1121ab98678fa81edc220c212a
[preflight] Running pre-flight checks
[WARNING SystemVerification]: missing optional cgroups: hugetlb
error execution phase preflight: [preflight] Some fatal errors occurred:
[ERROR CRI]: container runtime is not running: output: time="2023-09-06T04:33:28Z" level=fatal msg="validate service connection: validate CRI v1 runtime API for endpoint \"unix:///var/run/containerd/containerd.sock\": rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial unix /var/run/containerd/containerd.sock: connect: no such file or directory\""
, error: exit status 1
[preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...`
To see the stack trace of this error execute with --v=5 or higher
make: *** [Makefile:97: kubeadm-join] Error 1
make: Leaving directory '/opt/usernetes'
make: Entering directory '/opt/usernetes'
./Makefile.d/check-preflight.sh I can confirm this works on the main control-plane
so it's definitely just not being able to reach that port. |
I'm going to try adding egress for that port. It doesn't make sense that it worked the first time, but it's worth a shot! |
Nice! So the nodes (one worker node) is coming up again. So I think it was egress, but I can't say why it worked the first time! There is still some flakiness with something related to the actual instance and cgroups, I've seen this a couple of times (usually just one node - it's like one of the nodes randomly starts and doesn't have support for the updated cgroups (and reports missing systemd).
I would suspect this is Google Cloud or terraform related, not usernetes, but I don't know. But now that I know the egress was an issue and we had issue with the ports for the test app, I'm going to blow it up again and expose more for egress. Will send an update! |
okay reproduced what I had earlier - it seems a bit flaky (not the usernetes, the terraform) but this did work a second time. The place we are at is that the nodes come up, but the test doesn't work.
Let me know if you might be able to join Friday! If not we can keep going back and forth here. The next thing to figure out is why I can't shell / connect to a pod. |
Perhaps if there is some range of ips that need to be open for the pods I should try adding them to egress. Adding the entire range seemed to bork the fix for 6443.
|
I'm off to bed - thanks for the help today @AkihiroSuda ! |
👍
Looks like VXLAN doesn't seem to work with Google Cloud by default, although it works with AWS and Azure: Likely to be related to MTU. |
I'm going to ask if there are easy ways to get VXLAN working in GCP - ping @aojea. If not, I can prepare an equivalent setup on AWS. I have one for AWS with Flux, and I'd need to start that over to use a different ubuntu base, remove flux, etc. https://github.com/converged-computing/flux-terraform-ami. |
vxlan works, if there is a mtu problem then it is most probably solved by reducing the mtu on the origin or increasing it in the network (VM) https://cloud.google.com/vpc/docs/mtu so the encapsulation goes through |
@vsoch Are you still planning something today? (8:22 AM Friday here) |
@AkihiroSuda my mistake in mixing up my reference days - it's still Thursday here! So our hackathon would be tomorrow at 3pm Mountain time in the US (it looks like that's about 21.5 hours from now). And we have two things we can look at - first is the usernetes setup here, and the second is an AWS equivalent I've started, although we are still in early steps (e.g., ensuring each node knows the hostname of the others). |
Sorry, I’m not attending then, but happy to help your experiment with AWS |
no worries! I can give you an update then. I can tell you that I can't consistently get the GCP setup working, maybe because of networking stuffs. It worked once, but then not again, even when I upped the MTU. I'm hoping we just have more luck on AWS and can develop there - will give you an update! |
And @AkihiroSuda we will make sure to plan another one that is on our Thursday which I am realizing is your Friday morning next time. Apologies for the oversight! |
For
kind
and minikube KIC but for multi-host) #286Usernetes (Gen2) deploys a Kubernetes cluster on Rootless Docker hosts.
Usernetes (Gen2) is similar to Rootless
kind
and Rootless minikube,but Usernetes (Gen 2) supports creating a cluster with multiple hosts.
Components
Requirements
Rootless Docker
cgroup v2 delegation:
Using Ubuntu 22.04 hosts is recommended.
Usage
See
make help
.